Linear Regression

Let’s take a look at the relationship between basic sanitation access versus life expectancy for almost all countries in the year 2012. We measure basic sanitation access by the % of the population who have access to basic sanitation services, such as sewage systems, latrines, or composting toilets. Life expectancy measures the predicted number of years a newborn child would live, assuming current mortality patterns were to remain the same.

We’ll assume that there is a linear relationship between the 2 variables and conduct linear regression, where sanitation rate will be our response variable and life expectancy is our explanatory variable.

Our data was pulled from Gapminder, an independent educational non-profit company.

Preview of Our Data:

Here is what the relationship looks like between the two variables we chose to analyze:

The relationship between life expectancy and basic sanitation rate appears to be positive and linear.

Our dataset also includes the years of each observation, and notice how the regression line changes over time:

From this animated graph, it appears that the relationship becomes less positive over time, implying that each additional increase in the percent of the population with access to basic sanitation is increasing the country’s predicted life expectancy by less of an amount in later years.

We need to assume independence between observations for linear regression, so we’re only going to be looking at observations in the year 2012 to remove the time dependency.

Here are some of the features of our model:

term estimate std.error statistic p.value
(Intercept) 54.891426 0.9045093 60.68641 0
Sanitation 0.224345 0.0114075 19.66652 0

For 2012, we have a y-intercept \(\hat{\beta_0}\) of 54.9, which means the predicted life expectancy is 54.9 years when no one has access to basic sanitation services.

And our slope estimate \(\hat{\beta_1}\) is .22, which means we expect the average life expectancy to increase by .22 years for each % increase in population who has access to basic sanitation services.

Our linear regression line uses the formula:

\[\hat{y} = \hat{\beta_0} + \hat{\beta_1} \cdot x\]

where \(\hat{y}\) is the predicted life expectancy and \(x\) is the sanitation rate of a particular country in 2012.

And our observed response variables exactly follow the formula:

\[observed~life~expectancy = \hat{\beta_0} + \hat{\beta_1} \cdot sanitation~rate + noise\]

where noise, also called the residual, is the difference between the observed and predicted life expectancy. Assuming a linear regression model is a good fit, we would expect the noise to be normally distributed around 0.

Using the linear formula, our model for the relationship between a country’s life expectancy and the percent of its population with access to basic sanitation services is:

\[predicted~life~expectancy = 54.9 + .22 \cdot sanitation~rate\]

Now let’s figure out if our linear regression model is an accurate model to use. We’ll measure how much variability in the expected longevity is accounted for by our regression.

Response Variance Fitted Variance Residual Variance Proportion of Variability Explained
62.42396 41.86029 20.56367 0.6705805

The proportion of variability is .67, which follows from the formula

\[proportion~of~variability = R^2 = \frac{Fitted~Variance}{Response~Variance}\]

This indicates that 67% of the variability in Life Expectancy can be explained by the percentage of the population using basic sanitation with the model. In other words, approximately two-thirds of the variability in our regression model is accounted for. The unexplained variability can be attributed to country-specific factors such as gross domestic product, government, accessibility to healthcare, and other socioeconomic influences.

Model Simulation

Notice that we have a formula for observed life expectancy, which is our linear regression line with some added noise. We should be able to simulate or recreate our observed values by adding normally distributed noise around our regression line. This should be fairly accurate, as normally distributed noise, or residuals, with mean 0 is one of the conditions that a linear model is a good fit!

By construction, the linear regression line is the same for both plots. The density or location of the points is relatively the same as well, with some additional outliers in our simulated data.

Each point either represents a country’s observed life expectancy or simulated life expectancy for 2012. Let’s compare the two values for each country.

If the simulated life expectancy matches the observed life expectancy, the points would lie exactly on the red line \(y = x\). It appears that there are about as many overestimate as there are underestimates of life expectancies in 2012. The scatterplot indicates that our simulated values and our observed values are relatively close. This is also revealed because the \(R^2\) value between our simulated and observed life expectancies is 0.43. This means that 43% of the variability in the simulated life expectancies can be explained by the observed life expectancies through the model.

Our simulation was created using random noise, so each simulation will create a different set of points, which may or may not be good representations of our observed data, as each simulation will have its own \(R^2\) value. To circumvent this, we’ll perform numerous simulations and compute the \(R^2\) value for each simulation.

In this plot, we see that the simulated datasets have \(R^2\) values centered around .45. This means that on average, approximately 45% of the variability in the simulated life expectancies can be explained by the variability of observed life expectancies within the model.

The standard deviation of the distribution is 0.038, meaning the spread of the distribution is small and that the \(R^2\) values are about the same. Overall, our model generates data fairly similar, but not close to the observed data.

Conclusion

Overall, our linear regression model with an R-squared value centered around 0.45 is decent considering we are attempting to explain a complex variable such as life expectancy with a single explanatory variable, sanitation rate. However, further attempts to explore life expectancy could include models other than linear regression that include multiple explanatory variables.